Search CORE

6 research outputs found

Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Author: Bali Kalika
Kumar Ritesh
Lahiri Bornini
Ojha Atul Kr.
Raj Mohit
Ratan Shyam
Seshadri Vivek
Singh Siddharth
Sinha Sonal
Publication venue
Publication date: 26/06/2022
Field of study

In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.Comment: Speech for Social Good Workshop, 2022, Interspeech 202

arXiv.org e-Print Archive

Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

Author: Ali Ahmed
Glass James
Grondelaers Stefan
Jain Mayank
Kumar Ritesh
Lahiri Bornini
Ljubešić Nikola
Malmasi Shervin
Nakov Preslav
Oostdijk Nelleke
Samardžić Tanja
Scherrer Yves
Shon Suwon
Speelman Dirk
Tiedemann Jörg
van den Bosch Antal
van der Lee Chris
Zampieri Marcos
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2018
Field of study

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.Non peer reviewe

Radboud Repository

Helsingin yliopiston digitaalinen arkisto

Tilburg University Repository

Language identification and morphosyntactic tagging: The second VarDial evaluation campaign

Author: Ali Ahmed
Glass James
Grondelaers Stefan
Jain Mayank
Kumar Ritesh
Lahiri Bornini
Ljubešić Nikola
Malmasi Shervin
Nakov Preslav
Oostdijk Nelleke
Samardžić Tanja
Scherrer Yves
Shon Suwon
Speelman Dirk
Tiedemann Jörg
van den Bosch Antal
van der Lee Chris
Zampieri Marcos
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 20/08/2018
Field of study